Exploring Red Wine Qaulity

Introduction

This project involves the exploratory analysis of the dataset called wineQualityReds using R. The analysis helps to find the properties that affect the quality of wine using the univariate, bivariate and trivariate plots between different variables.

## [1] "/home/sumit/Desktop/data_analyst_nanodegree/EDA_R_P4_f"

since we need to add new variable in the data so created a copy of it to make the analysis process easier.

Overview of Data

Size of the Dataset

## [1] 1599   12

The dataset contains the 13 features and total of 1599 observations.

Features involved in the Dataset and their type.

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Wine quality mean is 5.636 and median is 6. Mean and median are quite similar for quality.

Univariate Plots Section

From the histograms it can be found that pH, density and quality are in normalized form while some are skewed towards left, some have outliers like sulphur relatred factors, chlorides and residual sugars. citric acid contains maximum null values.

Quality Review

It shows that there are 5 types of numerical quality in the datset ranging from 3 to 8 and most values of quality are 5 and 6.

Factoring the quality variable for better plots

converting it to factor variable would make it easier to run the analysis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

New Features

##     low  medium highest 
##      63    1319     217

Converting the wine quality into rating low, medium and highest for better analysis.

The residual sulphates, chlorides and residual sugar has the been found with the greater number of ouliers.

Scaling these so, that the graphs becomes normal

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The Graph now becomes in normal form.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The above graph becomes in normal form now.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

There is some outliers in the pH.

Box plot of citric acid before removing NULL values.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Box plot of citric acid after removing NULL values.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

As there is not much difference in the boxplot after removing the Null values. so, this shows that there is some missing data.

Univariate Analysis

What is the structure of your dataset?

There are total of 1,599 wine observations and 13 numeric variables. X is the unique identifier and fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality are the other 12 features.

Quality is the output variable and all others are the input variables.

What is/are the main feature(s) of interest in your dataset?

Main features of interest is the quality. We try to find out how all the other variables seemingly influence the quality of the wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

the following variables will support investigation as they have an interesting effect on the quality of wine: 1. alcohol 2. sulphates 3. citric.acid 4. volatile acid

Did you create any new variables from existing variables in the dataset?

yes, the outcome variable is converted into levels (low, avg and high) for the better analysis of data.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

In chlorides and residual sugar, the distribution is highly right skewed. Here, a transformation is applied to make the distribution normal. Also factoring is done on quality variable to make its analysis easier.

Bivariate Plots Selection

Calculating relationship between variables using coorelation values.

The correlation coefficients help in determining the strength of the bivariate relations. Highly correlated values include the alcohol content vs quality as well as sulphates, citric acid and so on has a higher effect on the quality.

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

Features that are positively correlated with quality are: alcohol:quality = 0.48 sulphates:quality = 0.25 citric.acid:quality = 0.23 fixed.acidity:quality = 0.12 residual.sugar:quality = 0.01

Features that are negatively correlated with quality are: volatile.acidity:quality = -0.39 total.sulfur.dioxide:quality = -0.19 density:quality = -0.17 chlorides:quality = -0.13 pH:quality = -0.06 free sulphur dioxide:quality = -0.05

On the basis of above plots we can justify the correlation how quality changes with alcohol. Higher quality wines has greater alcohol content as Alcohol has the highest coorelation with the quality(0.48).

This plot shows effect of wine quality on different acids. citric acid and quality are highly correlated(0.23) than fixed.acidity and quality(0.12) while volatile.acidity and quality are highly negatively correlated (-0.39).

The quality is higher for the wine with low volatile acidity. since the volatile acidity and quality are negatively coorelated(-0.39).

citric acid and sulphates are highly correlated(0.31) while alcohol vs sulphates(0.109) and alcohol vs citric acid(0.093) are moderately correlated.

pH has a very small correlation with quality(-0.058). It can be assumed that the higher quality wine has lower pH. But according to Plot most of the medium quality plot also has lower pH. This can be due to outlier.

pH increases with decrease in acidity since they are negatively coorelated at

## [1] -0.68

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

From the Coorelation matrix, it appears that fixed acidity, citric acid, sulphates and alcohol are directly correlated with better wine quality, and volatile acidity and pH are indirectly correlated. From the individual correlation tests, I found the similar trends with the exception of the pH showing less correlation.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

In the correlation matrix I found that fixed acidity and density are positively correlated. This shows that when fixed acidity increases the density of the wine is higher. The volatile acidity negatively correlates with the citric acid. This is an interesting observation. Also pH doesn’t have a much impact on quality of wine due to small correlation.

What was the strongest relationship you found?

The strongest relationship found between the fixed.acidity and citric.acid with correlation coefficient equal to 0.67 & relation between fixed.acidity and density with correlation coefficient equal to 0.67.

Multivariate Plots Section

The features of interest I get in Bivariate plots I am going to further explore them.

With the help of this plot we can get to an interpretation that both alchohol and sulphur are necessary for good wine.

This plot shows the very good understanding of the good wine(low acidity and high alchohol) and poor wine(high acidity and low alchohol)

This plot shows the very good understanding of the good wine(low citric acid and high alchohol) and poor wine(high citric acid and low alchohol)

pH doesn’t have a large impact on wine quality.

Effect of Alcohol and volatile Acidity on Wine extreme qualities.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

For the multivariate analysis the relationship between volatile.acidity and alcohol shows the great information. This plot shows the very good understanding of the good wine(low acidity and high alchohol) and poor wine(high acidity and low alchohol).

Were there any interesting or surprising interactions between features?

Since alcohol, specifically ethanol, is a weak acid, it was thought to be somewhat correlated with the presence of other acids, such as citric acid. The plot of alcohol against citric acid above clearly show their lack of correlation to each other. Also due to the small range of pH(3-4) there is not much effect observed on quality of alcohol.

Final plots and summary

Plot 1

correlation scores:-

citric.acid:quality = 0.23 fixed.acidity:quality = 0.12 residual.sugar:quality = 0.01

This plot shows that higher citric acid are found in better quality wines as their correlation scores(0.23) are greater. The absence of volatile acid also contribute to the higher quality wine.

Plot 2

Shows correlation between alcohol and quality(0.48). Due to greater correlation between alcohol and quality, alcohol has greater impact on quality of wine.

Plot 3

Effect of Alcohol and volatile Acidity on Wine extreme qualities(correlation score is -0.202). It shows that high volatile acidity with low alcohol content kept wine quality down and vice versa.

Reflection

The analysis began by loading the dataset and obtaining the overview of data. Univariate analysis is done in the first part. Many histograms were plotted. Plotted the distributions of all the variables in the dataset. Also, the quality variable was converted into a factor variable with levels. This helped in analysis of the quality variable.

I faced difficulty while analyzing the scatter plot with the function corrplot() So, I calculated the value of correlation separately to better analyze the data. Review a categorical variable is created. It gives the wine 3 grades low(3-4), avg(5-6) and high(7-8).

Applied Log transformations to variables like chlorides and residual sugar because distribution was highly skewed.

The coorelation coefficients is finded to study the relationships between all variables. With the help of coorelation coefficients the effect of each variable on quality is finded.The variables that had been identified to have strong correlation with quality are Volatile Acidity, Sulphates, Citric Acid. Plots were drawn to re-iterate the same variables and study their effect separately.

Multivariate analysis includes the exploration of the interaction of the variables and analysis to check the position of the high quality wine to establish relationships.

The wine quality is highly subjective on a individuals taste. A better study would be the inclusion of wine quantities sold in the market. Also the predictive model can be built to predict the wine quality.

Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!